Programming WCF Services : Queued Services - Playback Failures

3/1/2012 11:19:41 AM

Even after successful delivery, a message may still fail during playback to the service. Such failures typically abort the playback transaction, which causes the message to return to the service queue. WCF will then detect the message in the queue and retry. If the next call fails too, the message will go back to the queue again, and so on. Continuously retrying this way is often unacceptable. If the initial motivation for the queued service was load leveling, WCF’s auto-retry behavior will generate considerable stress on the service. You need a smart failure-handling schema that deals with the case when the call never succeeds (and, of course, defines “never” in practical terms). The failure handling will determine after how many attempts to give up, after how long to give up, and even the interval at which to try. Different systems need different retry strategies and have different sensitivity to the additional thrashing and probability of success. For example, retrying 10 times with a single retry once every hour is not the same strategy as retrying 10 times at 1-minute intervals, or the same as retrying 5 times, with each attempt consisting of a batch of 2 successive retries separated by a day. In general, it is better to hedge your bets on the causes for the failure and the probability of future success by retrying in a series of batches, to deal with sporadic and intermediate infrastructure issues as well as fluctuating application state. A series of batches, each batch comprised of a set number of retries in rapid succession, may just be able to catch the system in a state that will allow the call to succeed. If it doesn’t, deferring some of the retries to a future batch allows the system some time to recuperate. Additionally, once you have given up on retries, what should you do with the failed message, and what should you acknowledge to its sender?

1. Poison Messages

Transactional messaging systems are inherently susceptible to repeated failure, because the retries thrashing can bring the system to its knees. Messages that continuously fail playbacks are referred to as poison messages, because they literally poison the system with futile retries. Transactional messaging systems must actively detect and eliminate poison messages. Since there is no telling whether just one more retry might actually succeed, you can use the following simple heuristic: all things being equal, the more the message fails, the higher the likelihood is of it failing again. For example, if the message has failed just once, retrying seems reasonable. But if the message has already failed 1,000 times, it is very likely it will fail again the 1,001st time, so it is pointless to try again. In this case, the message should be deemed a poison message. What exactly constitutes “pointless” (or just wasteful) is obviously application-specific, but it is a configurable decision. MsmqBindingBase offers a number of properties governing the handling of playback failures:

public abstract class MsmqBindingBase : Binding,...
{
   //Poison message handling
   public int ReceiveRetryCount
   {get;set;}

   public int MaxRetryCycles
   {get;set;}

   public TimeSpan RetryCycleDelay
   {get;set;}

   public ReceiveErrorHandling ReceiveErrorHandling
   {get;set;}

   //More members
}

2. Poison Message Handling in MSMQ 4.0

With MSMQ 4.0 (available on Windows Vista, Windows Server 2008, and Windows 7 or later), WCF retries playing back a failed message in series of batches, for the reasoning just presented. WCF provides each queued endpoint with a retry queue and an optional poison messages queue. After all the calls in the batch have failed, the message does not return to the endpoint queue. Instead, it goes to the retry queue (WCF will create that queue on the fly). Once the message is deemed poisonous, you may have WCF move that message to the poison queue.

2.1. Retry batches

In each batch, WCF will immediately retry for ReceiveRetryCount times after the first call failure. ReceiveRetryCount defaults to five retries, or a total of six attempts, including the first attempt. After a batch has failed, the message goes to the retry queue. After a delay of RetryCycleDelay minutes, the message is moved from the retry queue to the endpoint queue for another retry batch. The retry delay defaults to 30 minutes. Once that batch fails, the message goes back to the retry queue, where it will be tried again after the delay has expired. Obviously, this cannot go on indefinitely. The MaxRetryCycles property controls how many batches at the most to try. The default of MaxRetryCycles is two cycles only, resulting in three batches in total. After MaxRetryCycles number of retry batches, the message is considered a poison message.

When configuring nondefault values for MaxRetryCycles, I recommend setting its value in direct proportion to RetryCycleDelay. The reason is that the longer the delay is, the more tolerant your system will be of additional retry batches, because the overall stress will be somewhat mitigated (having been spread over a longer period of time). With a short RetryCycleDelay you should minimize the number of allowed batches, because you are trying to avoid approximating continuous thrashing.

Finally, the ReceiveErrorHandling property governs what to do after the last retry fails and the message is deemed poisonous. The property is of the enum type ReceiveErrorHandling, defined as:

public enum ReceiveErrorHandling
{
   Fault,
   Drop,
   Reject,
   Move
}

2.2. ReceiveErrorHandling.Fault

The Fault value considers the poison message as a catastrophic failure and actively faults the MSMQ channel and the service host. Doing so prevents the service from processing any other messages, be they from a queued client or a regular connected client. The poison message will remain in the endpoint queue and must be removed from it explicitly by the administrator or by some compensating logic, since WCF will refuse to process it again if you merely restart the host. In order to continue processing client calls of any sort, you must open a new host (after you have removed the poison message from the queue). While you could install an error-handling extension to do some of that work, in practice there is no avoiding involving the application administrator.

ReceiveErrorHandling.Fault is the default value of the ReceiveErrorHandling property. With this setting, no acknowledgment of any sort is sent to the sender of the poison message. ReceiveErrorHandling.Fault is both the most conservative poison message strategy and the least useful from the system perspective, since it amounts to a stalemate.

2.3. ReceiveErrorHandling.Drop

The Drop value, as its name implies, silently ignores the poison message by dropping it and having the service keep processing other messages. You should configure for ReceiveErrorHandling.Drop if you have high tolerance for both errors and retries. If the message is not crucial (i.e., it is used to invoke a nice-to-have operation), dropping and continuing is acceptable. In addition, while ReceiveErrorHandling.Drop does allow for retries, conceptually you should not have too many retries—if you care that much about the message succeeding, you should not just drop it after the last failure.

Configuring for ReceiveErrorHandling.Drop also sends an ACK to the sender, so from the sender’s perspective, the message was delivered and processed successfully. For many applications, ReceiveErrorHandling.Drop is an adequate choice.

2.4. ReceiveErrorHandling.Reject

The ReceiveErrorHandling.Reject value actively rejects the poison message and refuses to have anything to do with it. Similar to ReceiveErrorHandling.Drop, it drops the message, but it also sends a NACK to the sender, thus signaling ultimate delivery and processing failure. The sender responds by moving the message to the sender’s dead-letter queue. ReceiveErrorHandling.Reject is a consistent, defensive, and adequate option for the vast majority of applications (yet it is not the default, to accommodate MSMQ 3.0 systems as well).

2.5. ReceiveErrorHandling.Move

The ReceiveErrorHandling.Move value is the advanced option for services that wish to defer judgment on the failed message to a dedicated third party. ReceiveErrorHandling.Move moves the message to the dedicated poison messages queue, and it does not send back an ACK or a NACK. Acknowledging processing of the message will be done after it is processed from the poison messages queue. While ReceiveErrorHandling.Move is a great choice if indeed you have some additional error recovery or compensation workflow to execute in case of a poison message, a relatively smaller set of applications will find it useful, due to its increased complexity and intimate integration with the system.

2.6. Configuration sample

Example 1 shows a configuration section from a host config file, configuring poison message handling on MSMQ 4.0.

Example 1. Poison message handling on MSMQ 4.0

<bindings>
   <netMsmqBinding>
      <binding name = "PoisonMessageHandling"
         receiveRetryCount    = "2"
         retryCycleDelay      = "00:05:00"
         maxRetryCycles       = "2"
         receiveErrorHandling = "Move"
      />
   </netMsmqBinding>
</bindings>

Figure 1 illustrates graphically the resulting behavior in the case of a poison message.

Figure 1. Poison message handling of Example 1

2.7. Poison message service

Your service can provide a dedicated poison message–handling service to handle messages posted to its poison messages queue when the binding is configured with ReceiveErrorHandling.Move. The poison message service must be polymorphic with the service’s queued endpoint contract. WCF will retrieve the poison message from the poison queue and play it to the poison service. It is therefore important that the poison service does not throw unhandled exceptions or abort the playback transaction . Such a poison message service typically engages in some kind of compensating work associated with the failed message, such as refunding a customer for a missing item in the inventory. Alternatively, a poison service could do any number of things, including notifying the administrator, logging the error, or just ignoring the message altogether by simply returning.

The poison message service is developed and configured like any other queued service. The only difference is that the endpoint address must be the same as the original endpoint address, suffixed by ;poison. Example 9-20 demonstrates the required configuration of a service and its poison message service. In Example 2 , the service and its poison message service share the same host process, but that is certainly optional.

Example 2. Configuring a poison message service

<system.serviceModel>
   <services>
      <service name  = "MyService">
         <endpoint
            address  = "net.msmq://localhost/private/MyServiceQueue"
            binding  = "netMsmqBinding"
            bindingConfiguration = "PoisonMesssageSettings"
            contract = "IMyContract"
         />
      </service>
      <service name = "MyPoisonServiceMessageHandler">
         <endpoint
            address  = "net.msmq://localhost/private/MyServiceQueue;poison"
            binding  = "netMsmqBinding"
            contract = "IMyContract"
         />
      </service>
   </services>
   <bindings>
      <netMsmqBinding>
         <binding name = "PoisonMesssageSettings"
            receiveRetryCount    = "..."
            retryCycleDelay      = "..."
            maxRetryCycles       = "..."
            receiveErrorHandling = "Move"
         />
      </netMsmqBinding>
   </bindings>
</system.serviceModel>

Receive Context

You should avoid having two or more service endpoints monitoring the same queue, since this will result in them processing each other’s messages. You may be tempted, however, to leverage this behavior as a load balancing of sort: you can deploy your service on multiple machines while having all the machines share the same queue. The problem in this scenario is poison message handling. It is possible for one service to return the message to the queue after an error for a retry, and then have a second service start processing that message, not knowing it was already tried or how many times. I believe the fundamental problem here is not with load balancing queued calls and playback errors; rather, it is with the need to load balance the queued calls in the first place. Load balancing is done in the interest of scalability and throughput. Both scalability and throughput have a temporal quality—they imply a time constraint on the level of performance of the service, and yet queued calls, by their very nature, indicate that the client does not care exactly when the calls execute.

Nonetheless, to enable multiple services to share a queue and manage playback errors, WCF defines the helper class ReceiveContext, defined as:

public abstract class ReceiveContext
{
   public virtual void Abandon(TimeSpan timeout);
   public virtual void Complete(TimeSpan timeout);

   public static bool TryGet(Message message,
                             out ReceiveContext property);
   public static bool TryGet(MessageProperties properties,
                              out ReceiveContext property);
   //More members
}

You enable the use of ReceiveContext with the ReceiveContextEnabled attribute:

public sealed class ReceiveContextEnabledAttribute : Attribute,
                                                     IOperationBehavior
{
   public bool ManualControl
   {get;set;}

   //More members
}

After a failure, you can use ReceiveContext to lock the message in the queue and prevent other services from processing it. This, however, results in a cumbersome programming model that is nowhere near as elegant as the transaction-driven queued services. I recommend you design the system so that you do not need to load balance your queued services and that you avoid ReceiveContext altogether.

3. Poison Message Handling in MSMQ 3.0

With MSMQ 3.0 (available on Windows XP and Windows Server 2003), there is no retry queue or optional poison queue. As a result, WCF supports at most a single retry batch out of the original endpoint queue. After the last failure of the first batch, the message is considered poisonous. WCF therefore behaves as if MaxRetryCycles is always set to 0, and the value of RetryCycleDelay is ignored. The only values available for the ReceiveErrorHandling property are ReceiveErrorHandling.Fault and ReceiveErrorHandling.Drop. Configuring other values throws an InvalidOperationException at the service load time.